On the nature of 
Phase- Type Poisson distributions 

Sophie Hautphenne* Guy Latouche"^ Giang T. Nguyen* 



Matrix-form Poisson probability distributions were recently intro- 
duced as one matrix generalization of Panjer distributions. We show in 
this paper that under the constraint that their representation is to be 
nonnegative, they have a physical interpretation as extensions of PH 
distributions, and we name this restricted family Phase-type Poisson. 
We use our physical interpretation to construct an EM algorithm-based 
estimation procedure. 
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1 Introduction 

First appeared in Panjer (1981), Panjer's algorithm is designed to compute 
efficiently the density of sums of the form S = J2 1<i<N Aj, where the AjS are 
i.i.d. positive random variables and iV is random, with a density {p n } that 
follows the recurrence relation 



Po being such that ^2 n>0 p n = ^- ^ the ^ s are nonne gative integer- valued 
random variables with density {/„}, then the density {g n } of S may be 
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for n > 1, 




recursively computed as 



9o = Po, 9n= ^2 fi9n-i(a + ib/n) for n > 1. (2) 

l<i<n 

This is a very efficient procedure, which has excellent numerical stability 
properties. 

The distributions that satisfy Q belong to a restricted set of families con- 
sisting of Poisson, binomial and negative binomial distributions (see Sundt 
and Jewell (1981)). Much effort has been spent to extend Panjer's algo- 
rithm to other distributions for N. In particular, its extension to Phase-type 
(PH) distributions is of great interest: since they are dense in the class of 
distributions on N, this significantly increases the applicability of Panjer's 
algorithm. 

Phase-type distributions have been introduced by Neuts (1975) and (1981) 
and they may be defined algebraically as follows: consider a sub-stochastic 
matrix T of order m such that / — T is nonsingular, a density vector a of 
order m, and define a sequence {v n } of row vectors with 

Vi = a(I - T), v n = v n _{T for n > 2. (3) 

The density po = 1 — al, p n = v n l, for n > 1, where 1 is a column vector 
of ones, is said to be of phase-type, with representation (a,T). There is a 
clear similarity between Q and (|3|, which suggests that the recursion ^ 
might be adapted to provide an efficient and numerically stable algorithm to 
compute the density of S when iV has a PH distribution. This is done in two 
recent papers, Wu and Li (2010) and Siaw et al. (2011). The former defines 
the generalized (a, b, 0) family as 

p n = jP n l for n > 0, (4) 

where the matrices {P n } of order m are recursively defined as follows: 

P n = P n - X {A + -B) for n > 1. (5) 
n 

The parameters are the matrices A, B, P and the vector 7, which is assumed 
to be nonnegative and normalized, so that 7I = 1. 

Siaw et al. (2011) define the generalized (a, 6,1) family, the difference 
being that the recursion ^ starts at n = 2, and the parameters are A, B, Pi 
and po, while the matrix P becomes irrelevant. The PH(ck, T) distribution 
belongs to the generalized (a, b, 1) family, with A = T, B = 0, po = 1 — al, 
7 = (alj^a and Pi = (al)(7 - T). 
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The core of the algorithm in Wu and Li (2010) and Siaw et al. (2011) is 
the vector recursion 

h n = V f t h n ^{A+-B) (6) 

l<i<n 

to replace (|2]), with g n = h n l. Ren (2010) gives an improved algorithm in 
case N and the X^s themselves are of phase-type. Finally, we note that PH 
distributions have rational generating functions, and this is the basis for the 
adaptation in Eisele (2006) of Panjer's algorithm to the case where N is PH. 
A comparison of the complexity and numerical stability of the algorithms in 
Eisele (2006), Ren (2010), Wu and Li (2010) and Siaw et al. (2011) is outside 
the scope of the present paper. 

We expect the generalized (a, b, 0) and (a, b, 1) distributions to form a 
very rich family since they include the PH distributions. However, as we 
show in the next section, the combination of two matrices in ^ makes these 
distributions a bit unwieldy, unless one imposes some simplifying constraint. 
In Section [2| we show that the series X^n>o ^ n * s a key quantity and that, for 
all practical purpose, it is necessary that the spectral radius of A be strictly 
less than one in order for the series to converge. Before doing so, we briefly 
address the issue of the choice of representation, and we adopt one that is 
slightly different from the representation in Wu and Li (2010) and Siaw et 
al. (2011). 

Next, we assume in Section |3]that A and B commute. As matrices go, this 
is a very strong constraint, but it considerably simplifies the determination 
of the generating function and of moments, and it is a property of all the 
examples in Wu and Li (2010) and Siaw et al. (2011). In Section |4j we focus 
our attention on distributions for which A = 0, B > 0, and 7 > 0. These 
distributions are interesting because they form a family totally distinct from 
PH distributions, yet they are amenable to a Markovian representation. For 
that reason, we call them Phase-type Poisson or PH-Poisson distributions. 
This physical interpretation opens the way in Section [5] to an estimation 
procedure based on the EM algorithm. 

2 Matrix generating function 

We are concerned with distributions {p n } defined as 

p n = f3P n l, where P n = TT (A + -B) for n > 0, (7) 

Ki<n 



3 



A and B are matrices of order m, and (3 is a row vector of size m. We use the 
convention that for n = 0, the matrix product in ([7]) is equal to the identity 
matrix, so that we may recursively define the P n s as 

P = I, P n = P n -i(A + -B) for n > 1. (8) 

n 

We shall write that {p n } has the representation P(/3, A, 5) of order m. 

This definition calls for a few comments. First, we assume that the re- 
cursion ([7]) starts with n — 0. In other words, we are not concerned in this 
paper with the possibility that the sequence {p n } does not conform to the 
general pattern for small values of n. Instead, we focus our attention, to a 
large extent, on the matrices P n . 

Second, our definition is slightly different from that of generalized (a, b, 0) 
distributions in Wu and Li (2010), where it is assumed that 7 is a stochastic 
vector (7 > 0, 7I = 1) and that Pq is a matrix chosen according to the 
circumstances. The two representations are equivalent as it suffices to define 
(3 = -yPo- Our reason to prefer ^ is that we do not find any advantage in 
requiring that (3 should be stochastic when A, B and Pq are allowed to be of 
mixed signs. Furthermore, our definition involves m 2 fewer parameters (the 
entries of Pq) and this savings will prove significant in Section [5] when we 
design an estimation procedure. 

Finally, one might use left- instead of right-multiplication and define P n = 
(A + -B)P n _i, yielding a possibly different family of distributions. Actually, 
we shall assume in the next section that A and B commute, so that there 
would be no difference. 

We need to impose some constraints on the representations of these dis- 
tributions, otherwise very little can be said in general. To begin with, let us 
associate a transition graph to the matrices A and B: the graph contains m 
nodes, and there is an oriented arc from % to j if \Aij \ + \B^\ 7^ 0. A node j is 
said to be useful if there exists a node i such that there is a path from i to j 
in the transition graph and such that fa 7^ 0; j is said to be useless otherwise. 
The lemma below shows that one may require without loss of generality that 
representations are chosen without useless nodes. 

Lemma 2.1 If the representation T>({3, A, B) of order m is such that there 
exists at least one useless node, then there exists another, equivalent, repre- 
sentation of order m' strictly less than m. 

Proof Assume that j is a useless node; define Si to be the subset of nodes 
containing j and all the nodes i for which there exists a path from % to j. 
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The matrices A and B may be written, possibly after a permutation of rows 
and columns, as 



A 



^1,1 




A 



and 



B 



2,2 



Bi,i 




Pl,2 
p2,2 



where A^i and Pii are indexed by the nodes in Si and A 22 and i?2,2 are 
indexed by the remaining nodes; similarly, we have 



(Pn) 1,1 




with (P n )i,i = rWJAi + fSi,i) and (P n ) 2 , 2 = Yli<i< n ( A W + \B 2y2 ). 

We partition f3 in a similar manner and write /3 = \J3 1 /3 2 ] • Since j is 
useless, f3 1 = 0. It is clear that f3P n l = f3 2 (P n ) 2t2 l, so that V(f3 2 , A 2j2 , -82,2) 
is an equivalent representation, of order strictly smaller than m. □ 

The generating function p(z) = ^2 n>0 z n p n may be written as p(z) = 
(3P(z; A,B)1, where ^ 



P(z; A,B) = J2 z n Pn = J2 zU ]\(A+\b), 



(9) 



n>0 



n>0 



KKn 



provided that the series in (|9| converges. We focus our attention on the ma- 
trix generating function P(z; A, B) and we discuss its convergence properties 
as z — >■ 1. A simple condition for P(l; A, B) to be finite is given in the next 
lemma. 



Lemma 2.2 If sp(A) < 1, where sp(-) denotes the spectral radius, then the 
series P(z;A,B) converges for \z\ < 1. 

If A > and P > ; £/ien £/ie inequality sp(A) < 1 is both necessary and 
sufficient. 

Proof The convergence radius R of the series in ^ is given by P _1 = 
limsup n ^/||P n ||, where || ■ || is any matrix norm. To simplify the notations, we 
define d = A+(l/i)B. For any consistent norm, ||P n || < ||Ci|| 1 1 C2 1 1 • • • ||Cn||- 
Furthermore, \\Ci\\ < \\A\\ + (l/z)||P|| and for any e > 0, there exists a norm 
such that \\A\\ < sp(A) + e. 

This implies that if sp(A) < 1, then there exist rj < 1 and i* such that 
||Cj|| < 77 for all % > i*. In addition, 

\\p n \\ 1/n < \\CiC 2 • • • a* n 1/n (iic^+iii • • • naii) 1/n 

< ||CiC , 2---c i *H 1/ V ri " i * )/n , 
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for n > i*, and HC1C2 • • • Ci* ^■f n rf n — > rj as n — > 00. We conclude, 



therefore, that limsup n P n \\ < rj and R > 1/rj > 1, which proves the first 
claim. 

If A and P are non-negative, then P n > A n and P(l; A, P) > Xl„>o 
since the last series diverges if sp(A) > 1, this completes the proof of the 
second claim. □ 

Note that sp(A) < 1 cannot be a necessary condition in all generality: to 
give one example, if there is some n* such that P n = for all n > n*, then 
the series in ^ reduces to a finite sum, and the spectral radius of A has no 
bearing on its convergence; such is the case if B = —n*A. 

We turn our attention to the derivatives 



M n (A,B)= ^P(z;A,B) 



(10) 



z=l 

for n > 1, assuming that they exist. In that case, the factorial moments of the 



distribution are given by m n = {3M n (A, B)l. From the proof of Lemma 2.2 



if sp(v4) < 1, then P(z; A, B) is a matrix of analytic functions in the closed 
unit disk, and it is a sufficient condition for the derivatives to be finite at 
z = l. 

Lemma 2.3 The matrices M n (A,B) are given by 

M n (A,B) = n\P n P(l;A,nA + B). (11) 
If sp(A) < 1, then we also have 

M n (A,B) = n\P(l;A,B)P n , (12) 



where 



Pn= J! ((A+lB^I-A)- 1 ). 



Ki<n 



Proof We write 

P n = ^l[(iA + B) = ^(A + B) J] (iA + {A + B)) 

' Ki<n ' ' KKn-1 



kA + B) n (A + hA + B)) 



n 

Kt<n-1 



so that 



P(z;A,B) = J2 nzn ~ lp n = (A + B)^^ 1 H (A+-(A + B)) 

n>l n>l l<i<n-l % 

= (A + B)P(z;A,A + B) 
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and, by induction 

a 



dz> 



P{z\ A, B) = (A + B)(2A + B)---(nA + B)P(z; A, nA + B) 



n\P n P(z; A,nA 



B) 



for all n, from which (11) results. 



On the other hand, Lemma 1 in Wu and Li (2010) states that 
j-P(z; A, B) = z^-P{z- A B)A + P(z; A, B)(A + B). 



If sp(A) < 1, then 

d_ 

dz 



P(z\ A, B) = P{z; A, B){A + B)(I - zA)' 1 



for \z\ < 1, from which (12) readily results by induction. 



□ 



This lemma points to the importance of being able to determine the 
matrix P(l; A, B). In some special cases, an explicit expression may be 
derived but in general, in the absence of any simplifying feature of the pair 
(A, B), there does not seem to be an alternative to the brute force calculation 
of the series £ n>0 Ui<i<n-M + \ B )- 



3 Commutative matrix product 

In this section, we assume that A and B commute and thereby obtain a 
stronger result than in Section [2} This assumption is satisfied for all examples 
in Wu and Li (2010), where either A = or B is a scalar multiple of A. It is 
also satisfied if A or B is a scalar matrix cl for some scalar c, or if B = 0. The 
latter includes PH(ck,T) distributions if there exists a solution to the system 
of linear constraints ot(I — T)T n_1 l = f3A n l for n > 1; if T is invertible, 
then an obvious solution is f3 — o.(I — T)T _1 , A = T. 

Thus, although it is a restrictive assumption from a linear algebraic point 
of view, it may be reasonable in the context of stochastic modeling. 

Theorem 3.1 If A and B commute, then P(z; A, B) = e ( A + B ) D ( z < A ) } where 
D{z ] A) = zY,\{zA) n - 1 

n>l 

= A^ 1 log(J — zA)^ 1 if A is nonsingular. 

Furthermore, if B = —kA for some integer k > 1, then P(z;A,B) = (I — 
zA) 1 '" 1 and P(l; A, B) is finite, otherwise, P(1;A,B) converges if and only 
ifsp(A) < 1. 



7 



Proof First, we observe that 



n (*a+s)= yi 



Ki<n 



0<i<n 



(A + B) i A n ~ i 



(13) 



where m are Stirling's numbers of the first kind. If A and B are scalars, 
then (13) is a straightforward consequence of the definition of Stirling's num- 
bers in Knuth (1968), Section 1.2.6, equation (40). To prove the extension 
to commuting matrices, one proceeds by induction, using 



0. 



and 



n 




n 




~n + I 




+ n 








% - 1 




i 




i 



for n > 1. Next, we write 

P(*;4B) = £*4 £ 



fc>0 



0<i<fc 



(A + BYA*-* byg[T3[ (14) 



i>0 k>i 



[zAf 



(15) 



since, as we show later, we may interchange the order of summation. By 
equations (25) and (26) in Knuth (1968), Section 1.2.9, 



k>i 



X 



k—i 



k>l 



X 



k-l\i 



(x logfl — x) 



and (15) becomes 



P(z; A,B) = ^2 - } ((A + B)D(z; A))\ 



i>0 



which proves the first claim. 
If B = -kA, then 



P(z;A,B) 



»(wME„>iM""V' 



,(fc-l)log(I-zA) 



(/ - zA) 



k-1 



Thus, it remains for us to justify the transition from (14) to (15). To that end, 
we show that the series is absolutely convergent if and only if sp(A) < 1. It 
is well-known that \\A k \\ = 0(l)sp(A) h k r asymptotically as k — > oo, for some 
integer r > 0. Furthermore, [J]/jfe! = 0(l)(logjfe) <-1 /(i - 1)!, by Theorem 1 
in Wilf (1993). Therefore, 



lim 

k— >oo 



1 

k\ 



\\A k ~ 



sp(A) lim y/k r (log ky- 1 = sp(A) 



k— >oo 
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so that the series Y2k>i [J C 2 ^)* * * n absolutely converges in ||^|| < 1 
if and only if sp(A) < 1, in which case its limit is \(D(z; A)/z) 1 . The 
equation (15) becomes 

P(z;A,B) = J2l((A + B)D(z;A)y 

i>0 

which converges without further constraint. □ 

This theorem confirms the important role of the matrix A with respect 
to the convergence of various series. A direct consequence is that if A, B\ 
and _E?2 are three commuting matrices, then 

P(Z;A,B 1 )P( Z ;A,B 2 ) = e (A+B 1 )D( Z ;A) e (A+B 2 )D( Z ;A) = e (2A + B 1+ B 2 )D( Z ;A) 

= P(z;A,A + B t + B 2 ), (16) 

so that, if A and B commute, we may write that 

P{z; A, kA + B) = P{z; A, B)(P{z; A, 0)) k = P(z; A, B)(e AD(z ' A) ) k 
= P{z;A,B)(I - zA)- k 



for k > 0, k integer, and we may state the following property, using either (11) 
or (Q: 

Corollary 3.2 If A and B commute, then the nth factorial moment of the 
distribution is given by 

m n ((3, A, B) = n!/3P(l; A, B)(I - A)- n P n l. (17) 

□ 

If one remembers that /3P(1; A,B) is a vector of which the components 
add-up to one, the similarity with the factorial moments of discrete PH distri- 
butions is striking (see equation (2.15) of Latouche and Ramaswami (1999)). 

To conclude this section, we review the examples in Wu and Li (2010): 

• If B = aA for a > -1, then (A + B)D(z; A) = (1 + a) log(J - zA)- 1 
and P(z; A, aA) = (I - zA)^ 1+a \ 

• If B = —kA for k > 0, k integer, P(z; A, —kA) = (I — zA) k ~ l as proved 
in Theorem 13.11 

• If A = 0, then D(z; 0) = z and P(z; 0, B) = e zB . 

We shall further examine this last case in the remainder of the paper. 
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4 PH-Poisson distributions 



4.1 Definition and comparison to PH distributions 

We restrict our attention to distributions for which A = 0, with the added 
constraint that (3 > 0, B > 0. The assumption that (3 and B are non- 
negative makes it easier to ascertain that T>(f3, 0, B) is the representation 



of a probability distribution. In addition, as we show in Theorem AA, it 
provides us with a physical interpretation in terms of a Markovian process, 
which explains why we call these Phase-type Poisson distributions, or PH- 
Poisson for short. 

Definition 4.1 A random variable X has a PH-Poisson distribution with 
representation V(f3, B) if 

P[X = n] = p n = — { (3B n l, forn>0, (18) 

where B > is a matrix of order m and (3 > is a row-vector of size m 
such that (3e B l = 1. 

Note that (31 < 1, unless B = 0. In the notation of Wu and Li (2010), 
the PH-Poisson distribution with representation V(f3, B) belongs to the gen- 
eralized (a, b, 0) family, with A = 0, B = B, Pq = e~ B and 7 = (3e B . The 
generating function p(z) = Yl n >o zU Pn ^ s gi ven by p( z ) = (3e zB l and the 
factorial moments by 

E[X(X - 1) ■ ■ ■ (X - n + 1)] = (3B n e B l. (19) 

It is easy to see that PH and PH-Poisson distributions are essentially two 
different families of probability distributions. Indeed, assume that X is 
PH-Poisson with representation V(f3, B) and Y is PH with representation 



PH(ck,T). From Q, it results that 

P[X = n] w (sp(B)) n n r /n\, (20) 
asymptotically as n — > 00, where r is the index of sp(B), and similarly, 

P[Y = n ]= ctT n -\l - T)l w (sp(T)) n n s , (21) 
where s is the index of sp(T). It is obvious that for any given B there is no T 



such that the right-hand sides of (20) and (21) coincide for all n big enough, 
unless sp(.B) = sp(T) = 0. 

If sp(.B) = 0, then there exists k < m such that B k = 0, the distribution 
of X is concentrated on {0, 1, . . . , k}, and X does have a PH representation 
by Theorem 2.6.5 in Latouche and Ramaswami (1999). 
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Figure 1: Density function for a PH-Poisson distribution with m — 10 and 
mean 18.71 (curve marked with *), the Poisson distribution with the same 
mean (marked with +) and the minimal-variance PH distribution with the 
same order and the same mean (marked with o). 



Example 4.2 Tail of the density. We see from (20) that the density of 
PH-Poisson distributions drops sharply to zero. We compare three different 
densities on Figure [TJ one is a PH-Poisson distribution, the second a PH 
distribution and the third a Poisson distribution. We have connected the 
points of the densities for better visual appearance, and we plot on the right- 
hand side the tail of the densities in semi- logarithmic scale. 

The curve marked with a "*" is the density of the PH-Poisson distribution 
with m = 10, Ba — 10, 1 < i < m, and -B^+i = 37.5, 1 < i < m — 1. The 
vector f3 is given by 

P = [1 • • • 0] (diag(e B l))- 1 

(it is easy to verify that f3e B l = 1.) Its mean fi, variance a 2 and coefficient 
of variation C.V. equal to o / ji are given in the first row of Table [Tj as well 
as the spectral radius S.R. of the matrix B. 

The curve marked with a "+" is the density of the PH distributions with 
the same order m and mean /i and minimal variance (see Telek (2000) for 
details). The curve marked with a "o" is the Poisson density with parameter 
equal to the mean /i. The variance, coefficient of variation, and spectral 
radius of these two densities are also given in Table [TJ 

The plot on the right-hand side of Figure [T] clearly indicates that the PH- 
Poisson density decays asymptotically the fastest of the three, this is due to 
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a 2 


C.V. 


S.R. 


PH-Poisson 


18.71 


10.35 


0.17 


10 


Phase-type 


U 


16.30 


0.22 


0.47 


Poisson 


U 


18.71 


0.23 


18.71 



Table 1: Mean, variance and coefficient of variation of the three distributions 
of Figure [TJ 





n 


a 2 


C.V. 


S.R. 


PH-Poisson 




37.71 


73.89 


1.96 


42 


Phase-type, m = 


= 10 


it 


104.5 


2.77 


0.73 


Phase-type, m = 


= 13 


it 


71.69 


1.90 


0.66 



Table 2: Mean, variance and coefficient of variation of the three distributions 
of Figure [2j 

the combination of a relatively small spectral radius and of the factor l/n\. 
We also see from the plot on the left-hand side, and from Table [TJ that it is 
the most concentrated around the mean. 

Example 4.3 Small variance. We pursue here the comparison between 
PH-Poisson distributions and minimal variance PH distributions, showing 
that PH-Poisson distributions may prove to be a useful alternative to PH 
distributions when modeling discrete distributions with small variance. 

The curve marked with a "*" on Figure [2] is the density of the PH-Poisson 
distribution with m = 10, Bu = 2 + 4i, 1 < i < m, and -B^j+i = 0.5, 
1 < i < m — 1. The vector f3 is given by 

/3 = -[l 1 ■■■ 1] (diag(e B l))- 1 . 
m 

Its mean, variance, coefficient of variation and spectral radius are given in 
Table 1 

The two other curves are the density functions of PH distributions with 
minimal variance, with the same mean as the PH-Poisson distributions, and 
with different orders. The one marked with "+" has the same order m = 10 
as the PH-Poisson distribution, the one marked with "o" has order m = 13, 
the smallest value for which the minimal variance is smaller than that of the 
PH-Poisson distribution. 

4.2 A physical interpretation 

We now give a physical interpretation for PH-Poisson distributions. First, we 
define the Poisson process 6*2, . . . } of rate v = maxj (-Bl)j. Second, we 
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0.05 




10 20 30 40 50 60 70 80 90 



Figure 2: Density function for a PH-Poisson distribution with m = 5 (curve 
marked with *), a minimal- variance PH distribution with m = 5 (marked 
with +) and a minimal-variance PH distribution with m — 17 (marked with 
o). The distributions all have the same mean /i = 34.00. 



define P = u^ 1 B; P is a sub-stochastic matrix, possibly stochastic. Next, we 
consider a discrete PH random variable K with representation (a, P), where 
a. = c/3 for some arbitrary but fixed constant c < (/31) _1 . In the present 
description, the Markov chain with transition matrix P makes a transition 
at each event of the Poisson process and it gets absorbed at time T = 8k- 
Finally, we count the number N(t) of transitions between transient states 
until the Markov chain enters its absorbing state; that is, N(t) is the number 
of Poisson events in the interval (0, t) for t < T and N(t) = K — 1 for t > T. 

Theorem 4.4 If a — c/3 for some arbitrary but fixed constant c < (/31) _1 , 
then p n defined in (18) is the conditional probability 

Pn = P[N{1) = n\T > 1]. (22) 

Proof Define M k {t) such that 

(M fc (t))y = P[N(t) = k,<p(t) = j\<p{0) = i], for 1 < i,j < m. 

One easily verifies that M (t) = e~ ut I and one proves by induction that 

M k (t) = e- u \vP)H k /k\ = e" ut B k t k /k\, (23) 



for k > 1. Equation (23) holds for k = and we assume that it holds for 
some k — 1. Conditioning on the epoch u of the first Poisson event, we find 
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that 



M k {t)= [ e~ vu vPM k ^{t-u) du 
Jo 

= [ e-^Be-^-^B^it-u^/ik-iy. du 
Jo 

= e~ vt B k /{k-l)\ [ {t-uf- 1 du 
Jo 

= e~ vt B k t k /k\. 
Taking t = 1, we find that 

P[JV(1) = *,T>1]= J2 E aJ>[N{t) = k,<p{t) = j\<P(P) = $ 

l<i<M l<j<M 

= ae- v B k /k\l, 

so that P[T > 1] = cte~ u e B l, and 

P[iV(l) = k\T > 1] = {cxe- v e B l)- l ote- v B k /k\l 
= (ae B l)- 1 aB h /k\l. 

If a = c/3 for any scalar c < (/31) _1 , then P[iV(l) = k\T > 1] = p k for all k. 
This concludes the proof. □ 



Remark 4.5 If P is stochastic, then the random variable K has an unusual 
PH distribution, as it is either equal to zero or to infinity. Still, the argument 
in the proof of Theorem 4^4 holds true. Note that if P is stochastic, then 
Bl = v\ and the distribution (18) is Poisson with parameter v. 



Example 4.6 This is a PH-Poisson distribution chosen to illustrate the com- 
bined effect of the conditional distribution imposed on the number of transi- 
tions among the phases. The representation is (/3, B) with 



B 



and 



5 .05 

.05 9 .05 

.05 13 .05 

.05 17 .05 

.05 21 



(24) 



(3 = 1 [5. 2.5 3. 2.25 6.] exp{-diag(5, 9, 13, 17, 21)} 



(25) 
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0.04 
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0.01 




5 10 15 20 25 30 35 40 45 



Figure 3: Density function for a PH-Poisson distribution with representation 



(J3,B) given in (24, 25) 



where the scaling factor 7 is such that f3e B l = 1. Its first two moments 
are /x = 13.84 and a 2 = 47.31, and its density is given on Figure |3j The 
phase-type representation is (u; a, P) with v = 21.05, 



and 



a 



0.2375 0.0024 

0.0024 0.4276 0.0024 

0.0024 0.6176 

0.0024 



1-2 n on 1 n— 3 






0.0024 
0.8076 







0.0024 



0.0024 0.9976 



(26) 



[0.99 0.91 10- 2 0.20 10" 3 0.27 KT 5 0.1310~ 6 ], 

which is the vector (3 normalized so that al = 1. 

Denote by N* = sup{fc : 9k < 1} the total number of Poisson events in 
(0, 1). If T < 1, then N(l) = K — 1 < N*, if T > 1, then N(l) = N* < K. 
On the average, the Poisson process produces v events in the interval (0, 1), 
and the Poisson distribution has a relatively small standard deviation, so 
one expects iV* to take values close to v ~ 21. The matrix P in (26) is 



irreducible, albeit with a small probability of migration from one phase to 
another, so that the initial phase plays a significant role in the distribution 
of K. 

If the initial phase is 1, then the absorption probability is about 0.76, and 
it is likely that K will be small; it is therefore necessary, for the condition 
T > 1 to be fulfilled, that the Poisson process produces few events in (0, 1). 
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On the other hand, if the initial phase is 5, then the PH Markov chain 
will remain in that phase for a large number of transitions, it is likely that 
K will be large, so that T is likely to be much larger than 1, and it is not 
expected that the condition [T > 1] puts much constraint on iV*. 



5 EM algorithm 

In this section, we exploit the probabilistic interpretation of PH-Poisson dis- 



tributions given in Section 4J2, and we develop an EM algorithm for fitting 
PH-Poisson distributions into data samples. 

The EM algorithm is a popular iterative method in statistics for comput- 
ing maximum-likelihood estimates from data that is considered incomplete. 
The procedure can be explained briefly as follows. Let 6 G £1 be the set 
of parameters to be estimated. We denote by X a random complete data 
sample and by j{X \ 6) its conditional density function, given the parameters 
0. The maximum-likelihood estimator 6 is defined as 

6 = argmaxlog/(Af | 6). 

For one reason or another, instead of observing the complete data sample X, 
we observe an incomplete data sample y. Thus, X can be replaced by its 
sufficient statistic (y, Z), where Z is the sufficient statistic of the unobserved 
data. As X is unobservable, instead of maximizing log/(A' | 6) we maximize 
its conditional expectation given the incomplete data sample y = y and the 
current estimates 6^ s \ at each (s + l)th iteration for s > 0. 
The EM algorithm can thus be decomposed into two steps: 

• E-step — computing the conditional expectation of log f(X | 0) given the 
incomplete data sample y and the current estimates 0^ s ' 

Q(0,0^) =E[log f(X\0)\y,6(% 

• M-step — obtaining the next set of estimates by maximizing the 
expected log-likelihood determined in the E-step 

(s+1) = argmaxQ(0,0 (s) ). 

When fitting a PH-Poisson distribution into a data sample, the parameters 
to be estimated are = {u,a,P}. Without loss of generality, we assume 



that al = 1 in the chosen representation. By Theorem 4.4, an observation 
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y can be thought of as the number of Poisson events in the time interval 
[0, 1], given that the transient Markov chain with the transition matrix P 
has not been absorbed at time t — 1. This observation can be considered 
incomplete as it tells us neither the initial phase tp(0) of the Markov chain nor 
how it has evolved during [0, 1]; a complete observation can be represented 
by x = (ipo, </?i,..., <f y ), where (fii is the phase of the Markov chain at the ith 
Poisson event and <f { ^ for alH = 0, . . . , y. The conditional density of the 
complete observation x given 6 is 

v y ~ l 



f(x\0) = ( ae » p ir i -a ipo n^. 



y i=0 

Suppose that the complete data sample x contains n observations, each 
of which is denoted by and includes an incomplete observation y^ k \ for 
k — 1, . . . ,n. Then, the conditional density of x given is 

n [fc] n n A/M-l 

f(x | 6) = (ae-l)- II ~Jk~u II II 

k=i y ' k=l k=l \ i=0 

n ,.[fc] m mm 

v y tt rrrr w« 



=(-" p D- , n^n«f , nn 



y[k}\ 11"' LlllPiJ ' 

fc = l ^ ' 1=1 1=1 J = l 



where 

n 

S'j = > l f [fc] .-, for 2 = 1,..., m 
k=l 

is the number of complete observations in x with initial phase i, and 

n 

Ni i = J2J2 1 W W 1 =i, <p W= j} for 2, j = 1, . . . , m 

k=l t>l 

is the total number of jumps in x from phase % to phase j. Thus, the log- 
likelihood function is given by 

n n 

log f{x \0) = -n log(ae^l) + £ log v - ^ logfoN !) 

fc=i fe=i 

m mm 

"X^-logo,--^^ % log Plj . (27) 

i=l i=l j=l 
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Maximum-likelihood estimators To obtain closed-form expressions for 
the maximum-likelihood estimators 6 is not straightforward. Applying the 
Karush-Kuhn- Tucker approach (see Chapter 12 in Nocedal and Wright (2000)), 
it can be verified that the maximization problem 

max log f{x | 0) 

subject to 



al = 1, PI < 1, v > 0, Pij,ca>0 for i,j = 1, . . . , m, 
has the associated Lagrangian 



2m+l 



C(0,\,n) = \ogf(x | 6) - \h(0) - K9i(0), 

i=l 

where 

fe(0) = al - 1, 

m 

^(0) = 1 - X) Pij for z = 1, . . . , m, 

j'=i 

= aj_ m for i — m + 1, . . . , 2m, 

= i/ for i = 2m + 1 , 

and A and = (pi, . . . ,fi2m+i) > denote the Lagrangian multipliers as- 
sociated with the equality constraint h{0) = 0, the inequality constraints 
9i(9) > for i = 1, . . . , 2m and g2m+i(0) > 0, respectively. The KKT condi- 
tions, which are first-order necessary conditions for constrained optimization 
problems, imply that the maximum-likelihood estimators = (£>, a, P) must 
satisfy the following constraints 

ae* p (fiPl - ^ k=lV 1) = (28) 
n 

''(^' S ') ; forz = l,...,m, (29) 



Q " L Q 

TIOL I /- .,\r> „ „,r> , 



/ e (--) p e ^e wP d«l-^ <0, (30) 

for i, j = 1, . . . , m, where 7^ = eJe vP l and e< is the column vector of size m 
with the ith component being 1 and all other components being 0. 

Recall from Remark 4.5 that if P is stochastic then the PH-Poisson dis- 
tribution with representation (z/, a, P) is a Poisson distribution with param- 
eter v. In this case, the constraints (28)-(30) simplify considerably: the first 
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implies that v = Ylk=i V^/n, the well-known maximum-likelihood estimator 
for the parameter of a Poisson distribution; the second becomes = Si/n, 
the maximum-likelihood estimator for the initial vector of a discrete PH dis- 
tribution (see Asmussen et al. (1996)); and the third reduces to 

na f e^-'W,-^ < 0, 

J0 Pij 

or, equivalently, 

vpijcx e 0{p - I)x dxei - <0, (31) 
Jo n 

for i,j = 1, . . . , m. As P is stochastic, summing the left-hand side of (31) 
over i and j gives us 

mm 

e^ p ~ I)x dxl -l/n^iVi^y-l/n^ iV^ = 0, 



which implies that (31 ) is an equality for all i, j = 1, . . . , m. 



Conditional expectation Thanks to the linear nature of log f{X \ 0) in 

the unobserved data Z = {Si, Nij : i, j = 1, . . . , m}, the computation of the 
conditional expectation of \ogf(X \ 6^) at the (s + l)th iteration reduces to 
the computation of ~E[Z \ y, 0^]: 

n 

e[£ 4 |i/,0M] = £e[i | y w,eW] 



fc=i 



fc=i 



p[y W = y W 1 



y , L fori = l,...,m, (32) 

4^ ck( s ) (P( s ))^ [ ] 1 



and 



E[A^ | y, 0<'>] = £ J2 VKfl^,^} I ^ * 



Ml 



k=i t=i 



n y 



[*■■] 



EE 



; pfoft = i | »<•>] p^f 1 = j | »<•>, ^fj, = i] p[yw = 1,1*1 1 0<«>, „J* 1 = j] 



fe=l t=l 
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V - ^= l1 ' lVl3 J 1 ' - for i i - 1 m (331 

^ a W(p(.))»Wi torz,j-l,...,m. (ddj 



New estimates In the M-step, we obtain the new estimates #( s+1 ) = 
(z/ s+1 \ cS s+1 \ p( s+1 )) by maximizing the log-likelihood (27) where {5j,iVy : 
z, j = 1, . . . ,p} are replaced by their conditional expectations Ei[S% \ y, 0^] 
and EfiVjj | 0^] evaluated in the E-step. The maximization problem to be 
solved in this step is as follows 

max logf(y,E{S t \0(%E{N tl \0^]\0) 

subject to 

al = 1, PI < 1, v > 0, pij, oti > for i, j — 1, . . . , m. 



We implemented the EM algorithm in MATLAB and experimented with 
samples simulated from different PH-Poisson distributions. Below are the 
results of one such experiment. 

Example 5.1 We used the PH-Poisson distribution (u;a,P) given in Ex- 
ample 4.6| to generate a sample with 1500 observations. The chosen initial 
parameters are i/W = 10, ck (0) = [0.1 0.2 0.4 0.2 O.l] and 

P (0) = diag(0.5,0.3,0.5,0.7,0.1). 

The estimated parameters obtained after 25 iterations of the EM algorithm 
are z/ 25 ) = 20.4290, ct^ = [0.9054 0.060 0.0335 0.0000 0.0010],and 



p(25) 



0.2401 0.0000 0.0000 

0.0000 0.2610 0.0000 

0.0000 0.0000 0.4543 

0.0000 0.0000 0.0000 

0.9939 0.0026 0.0000 



0.0000 0.0000 

0.0000 0.0000 

0.0000 0.0000 

0.9088 0.0000 

0.0000 0.0035 



The Manhattan norm || ■ 1^ of the difference between the true density and 
the empirical data is 0.1109, between the true density and the estimated 
density is 0.1043, and between the empirical data and the estimated density 
is 0.1400. We plot four densities in Figure |4j that for the true PH-Poisson 
distribution, the empirical data, the initial density and the estimated density. 
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Figure 4: Density function for the true PH-Poisson distribution (the dot- 
ted curve), empirical data (the dashed curve), the initial density (the curve 
marked with o) and the estimated density (the continuous curve). 

It is well-known that although the sequence {0 (s) } s >i computed with the EM 
algorithm always converges, it does not always converge to the maximum- 
likelihood estimator 6, but possibly to some local maximum or stationary 
value of log f(X\6). The warranty of global convergence for the EM algorithm 
depends on properties of the conditional density of the incomplete data y 
given 0, and sometimes also on the starting point 0^°\ We refer to Dempster 
et al. (1977) for further details on the EM algorithm, and to Wu (1983) for 
its convergence properties. 

Our experiments were performed using the MATLAB optimization rou- 
tine f mi neon to solve the maximization problem in the M-step. They indi- 
cated that the results were highly sensitive to the choice of 6^°\ When the 
starting point was chosen randomly, we observed that the EM algorithm of- 
ten converged to a Poisson distribution with parameter ^fe=i 2/'*V n ; even if 
this was a rather poor fit for the given sample. Convergence to a good fit 
was obtained with a starting point that either shares the same structure of 
zeros with the true parameters ot and P, or has a strictly positive ot^ and 
a diagonal matrix p(°' — a mixture of Poisson distributions. 

The latter choice is obviously more practical when the structure of the 
true parameters is not known a priori. Empirically, a diagonal P^ proved 
to be a good starting point even if the true matrix P is not diagonal. Note 
that, unlike its counterpart for fitting discrete Phase- type distributions in 
Asmussen et al. (1996), the EM algorithm for fitting PH-Poisson distribu- 
tions does not necessarily preserve the initial structure. This is due to the 



term -nlog(ae 1) in (27). Consequently, when starting with a diagonal 



P^ ) the EM algorithm does not necessarily converge to a diagonal P. An 
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interesting question for future research is to explain why mixtures of Poisson 
distributions serve as good starting points in the EM algorithm for fitting 
Phase-type Poisson distributions. 
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